168 research outputs found
TIMELINE: Exhaustive Annotation of Temporal Relations Supporting the Automatic Ordering of Events in News Articles
Temporal relation extraction models have thus far been hindered by a number
of issues in existing temporal relation-annotated news datasets, including: (1)
low inter-annotator agreement due to the lack of specificity of their
annotation guidelines in terms of what counts as a temporal relation; (2) the
exclusion of long-distance relations within a given document (those spanning
across different paragraphs); and (3) the exclusion of events that are not
centred on verbs. This paper aims to alleviate these issues by presenting a new
annotation scheme that clearly defines the criteria based on which temporal
relations should be annotated. Additionally, the scheme includes events even if
they are not expressed as verbs (e.g., nominalised events). Furthermore, we
propose a method for annotating all temporal relations -- including
long-distance ones -- which automates the process, hence reducing time and
manual effort on the part of annotators. The result is a new dataset, the
TIMELINE corpus, in which improved inter-annotator agreement was obtained, in
comparison with previously reported temporal relation datasets. We report the
results of training and evaluating baseline temporal relation extraction models
on the new corpus, and compare them with results obtained on the widely used
MATRES corpus.Comment: Accepted for publication in EMNLP 2023: 13 pages, 3 figures and 14
table
Learning to Play Chess from Textbooks (LEAP): a Corpus for Evaluating Chess Moves based on Sentiment Analysis
Learning chess strategies has been investigated widely, with most studies
focussing on learning from previous games using search algorithms. Chess
textbooks encapsulate grandmaster knowledge, explain playing strategies and
require a smaller search space compared to traditional chess agents. This paper
examines chess textbooks as a new knowledge source for enabling machines to
learn how to play chess -- a resource that has not been explored previously. We
developed the LEAP corpus, a first and new heterogeneous dataset with
structured (chess move notations and board states) and unstructured data
(textual descriptions) collected from a chess textbook containing 1164
sentences discussing strategic moves from 91 games. We firstly labelled the
sentences based on their relevance, i.e., whether they are discussing a move.
Each relevant sentence was then labelled according to its sentiment towards the
described move. We performed empirical experiments that assess the performance
of various transformer-based baseline models for sentiment analysis. Our
results demonstrate the feasibility of employing transformer-based sentiment
analysis models for evaluating chess moves, with the best performing model
obtaining a weighted micro F_1 score of 68%. Finally, we synthesised the LEAP
corpus to create a larger dataset, which can be used as a solution to the
limited textual resource in the chess domain.Comment: 27 pages, 10 Figures, 9 Tabel
Semantics Altering Modifications for Evaluating Comprehension in Machine Reading
Advances in NLP have yielded impressive results for the task of machine
reading comprehension (MRC), with approaches having been reported to achieve
performance comparable to that of humans. In this paper, we investigate whether
state-of-the-art MRC models are able to correctly process Semantics Altering
Modifications (SAM): linguistically-motivated phenomena that alter the
semantics of a sentence while preserving most of its lexical surface form. We
present a method to automatically generate and align challenge sets featuring
original and altered examples. We further propose a novel evaluation
methodology to correctly assess the capability of MRC systems to process these
examples independent of the data they were optimised on, by discounting for
effects introduced by domain shift. In a large-scale empirical study, we apply
the methodology in order to evaluate extractive MRC models with regard to their
capability to correctly process SAM-enriched data. We comprehensively cover 12
different state-of-the-art neural architecture configurations and four training
datasets and find that -- despite their well-known remarkable performance --
optimised models consistently struggle to correctly process semantically
altered data.Comment: AAAI 2021, final version. 7 pages content + 2 pages reference
Towards End-User Development for IoT: A Case Study on Semantic Parsing of Cooking Recipes for Programming Kitchen Devices
Semantic parsing of user-generated instructional text, in the way of enabling
end-users to program the Internet of Things (IoT), is an underexplored area. In
this study, we provide a unique annotated corpus which aims to support the
transformation of cooking recipe instructions to machine-understandable
commands for IoT devices in the kitchen. Each of these commands is a tuple
capturing the semantics of an instruction involving a kitchen device in terms
of "What", "Where", "Why" and "How". Based on this corpus, we developed machine
learning-based sequence labelling methods, namely conditional random fields
(CRF) and a neural network model, in order to parse recipe instructions and
extract our tuples of interest from them. Our results show that while it is
feasible to train semantic parsers based on our annotations, most
natural-language instructions are incomplete, and thus transforming them into
formal meaning representation, is not straightforward.Comment: 8 pages, 1 figure, 2 tables. Work completed in January 202
Extracting granular information on habitats and reproductive conditions of Dipterocarps through pattern-based literature analysis
Lowland tropical rainforests in Southeast Asia primarily comprised of dipterocarp species are one of the most threatened ecosystems in the world. Belonging to the family Dipterocarpaceae, dipterocarps are economically and ecologically important due to their timber value as well as contribution to wildlife habitat. The challenge in the restoration and rehabilitation of these Dipterocarp forests lies in their complex reproduction patterns, i.e., supra-annual mass flowering events that may occur in irregular intervals of two to ten years, possibly synchronously across Asia. Understanding their regeneration to make plans for effective reforestation can be aided by providing access to a comprehensive database that contains long-term and wide-scale data on dipterocarps. The content of such a database can be enriched with literature-derived information on habitats and reproductive conditions of dipterocarps.
We aim to develop literature mining methods to automatically extract information relevant to the distribution and reproductive cycle of dipterocarps, in order to help predict the likelihood of their regeneration, and subsequently make informed decisions regarding species for reforestation. In previous work, we developed a machine learning-based named entity recognition (NER) model that automatically annotates entities relevant to species’ distribution, e.g., taxon names, geographic locations, temporal expressions, habitats, authorities, and names of herbaria. Furthermore, the species’ reproductive condition, e.g., whether it is sterile or in the state of producing fruit ("in fruit") or flower ("in flower"), was also automatically annotated to enable the derivation of phenological patterns. The model was trained on a manually annotated corpus of documents, e.g., scholarly articles and government agency reports.
In this work, we focus our efforts specifically on the extraction of relationships between habitats and their locations, and between reproductive conditions and temporal expressions. To this end, we have developed a syntactic pattern-based matching approach by building upon Grew (http://grew.fr/), a graph rewriting system for manipulating linguistic representations. For our purposes, patterns that made use of syntactic dependencies, part-of-speech tags and named entity types (derived from NER results) were designed. When fed into Grew, these patterns were able to analyse sentences in scholarly articles by associating habitats with their geographic locations, and by determining a species’ reproductive condition at a specific point in time. The resulting relationships are then used to enrich information contained in a database of dipterocarp occurrences. Such a resource will provide more comprehensive ecological data that could form the basis of more informed reforestation decisions
- …